Back in 2014, fivethiryeight.com published an article on alchohol consumption in different countries. The data drinks is available as part of the fivethirtyeight package. Make sure you have installed the fivethirtyeight package before proceeding.
library(fivethirtyeight)
data(drinks)
# or download directly
# alcohol_direct <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv")What are the variable types? Any missing values we should worry about?
skimr::skim(drinks)| Name | drinks |
| Number of rows | 193 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| country | 0 | 1 | 3 | 28 | 0 | 193 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| beer_servings | 0 | 1 | 106.16 | 101.14 | 0 | 20.0 | 76.0 | 188.0 | 376.0 | ▇▃▂▂▁ |
| spirit_servings | 0 | 1 | 80.99 | 88.28 | 0 | 4.0 | 56.0 | 128.0 | 438.0 | ▇▃▂▁▁ |
| wine_servings | 0 | 1 | 49.45 | 79.70 | 0 | 1.0 | 8.0 | 59.0 | 370.0 | ▇▁▁▁▁ |
| total_litres_of_pure_alcohol | 0 | 1 | 4.72 | 3.77 | 0 | 1.3 | 4.2 | 7.2 | 14.4 | ▇▃▅▃▁ |
alcohol_data <- drinksTEAM ANSWER:
The variable types are character and numeric. There are no missing values. There are 4 numeric and 1 character type variable.
Make a plot that shows the top 25 beer consuming countries
top25_beer <- drinks %>%
slice_max(., order_by = beer_servings, n = 25)
ggplot(top25_beer,aes(y = reorder(country, beer_servings), x = beer_servings))+
geom_col()+
labs(x = "Beer Servings",
y = "Country",
title = "Top 25 Beer Consuming Countries")+
theme_bw()+
NULLMake a plot that shows the top 25 wine consuming countries
# YOUR CODE GOES HERE
top25_wine<- drinks %>%
slice_max(.,order_by = wine_servings,n=25)
ggplot(top25_wine, aes(y = reorder(country, wine_servings), x = wine_servings ))+
geom_col()+
labs(x = "Wine Servings",
y = "Country",
title = "Top 25 Wine Consuming Countries")+
theme_bw()+
NULLFinally, make a plot that shows the top 25 spirit consuming countries
# YOUR CODE GOES HERE
top25_spirit<- drinks %>%
slice_max(.,order_by = spirit_servings,n=25)
ggplot(top25_spirit, aes(y = reorder(country, spirit_servings), x = spirit_servings ))+
geom_col()+
labs(x = "Spirit Servings",
y = "Country",
title = "Top 25 Spirit Consuming Countries")+
theme_bw()+
NULLWhat can you infer from these plots? Don’t just explain what’s in the graph, but speculate or tell a short story (1-2 paragraphs max).
TEAM ANSWER:
The countries at the top of the list have high production of the given alcoholic beverage. Historic national connection to the given drink and national pride in its production seems to be an influencing factor in the choice of alcohol consumption. For example in Namibia the brewing industry is regarded as a source of national pride.
We will look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset
movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)## Rows: 2,961
## Columns: 11
## $ title <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …
Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:
gross : The gross earnings in the US box office, not adjusted for inflationbudget: The movie’s budgetcast_facebook_likes: the number of facebook likes cast memebrs receivedvotes: the number of people who voted for (or rated) the movie in IMDBreviews: the number of reviews for that movierating: IMDB average ratinggross and budget) by genre. Calculate a variable return_on_budget which shows how many $ did a movie make at the box office for each $ of its budget. Ranked genres by this return_on_budget in descending orderskimr::skim(movies) # check summary of the dataset| Name | movies |
| Number of rows | 2961 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| title | 0 | 1 | 1 | 83 | 0 | 2907 | 0 |
| genre | 0 | 1 | 5 | 11 | 0 | 17 | 0 |
| director | 0 | 1 | 3 | 32 | 0 | 1366 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2.00e+03 | 9.95e+00 | 1920.0 | 2.00e+03 | 2.00e+03 | 2.01e+03 | 2.02e+03 | ▁▁▁▂▇ |
| duration | 0 | 1 | 1.10e+02 | 2.22e+01 | 37.0 | 9.50e+01 | 1.06e+02 | 1.19e+02 | 3.30e+02 | ▃▇▁▁▁ |
| gross | 0 | 1 | 5.81e+07 | 7.25e+07 | 703.0 | 1.23e+07 | 3.47e+07 | 7.56e+07 | 7.61e+08 | ▇▁▁▁▁ |
| budget | 0 | 1 | 4.06e+07 | 4.37e+07 | 218.0 | 1.10e+07 | 2.60e+07 | 5.50e+07 | 3.00e+08 | ▇▂▁▁▁ |
| cast_facebook_likes | 0 | 1 | 1.24e+04 | 2.05e+04 | 0.0 | 2.24e+03 | 4.60e+03 | 1.69e+04 | 6.57e+05 | ▇▁▁▁▁ |
| votes | 0 | 1 | 1.09e+05 | 1.58e+05 | 5.0 | 1.99e+04 | 5.57e+04 | 1.33e+05 | 1.69e+06 | ▇▁▁▁▁ |
| reviews | 0 | 1 | 5.03e+02 | 4.94e+02 | 2.0 | 1.99e+02 | 3.64e+02 | 6.31e+02 | 5.31e+03 | ▇▁▁▁▁ |
| rating | 0 | 1 | 6.39e+00 | 1.05e+00 | 1.6 | 5.80e+00 | 6.50e+00 | 7.10e+00 | 9.30e+00 | ▁▁▆▇▁ |
unique(movies) # check number of uniques entries## # A tibble: 2,961 x 11
## title genre director year duration gross budget cast_facebook_l… votes
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Avatar Action James C… 2009 178 7.61e8 2.37e8 4834 8.86e5
## 2 Titanic Drama James C… 1997 194 6.59e8 2 e8 45223 7.93e5
## 3 Jurassi… Action Colin T… 2015 124 6.52e8 1.5 e8 8458 4.18e5
## 4 The Ave… Action Joss Wh… 2012 173 6.23e8 2.2 e8 87697 9.95e5
## 5 The Dar… Action Christo… 2008 152 5.33e8 1.85e8 57802 1.68e6
## 6 Star Wa… Action George … 1999 136 4.75e8 1.15e8 37723 5.35e5
## 7 Star Wa… Action George … 1977 125 4.61e8 1.1 e7 13485 9.11e5
## 8 Avenger… Action Joss Wh… 2015 141 4.59e8 2.5 e8 92000 4.63e5
## 9 The Dar… Action Christo… 2012 164 4.48e8 2.5 e8 106759 1.14e6
## 10 Shrek 2 Adven… Andrew … 2004 93 4.36e8 1.5 e8 1148 3.15e5
## # … with 2,951 more rows, and 2 more variables: reviews <dbl>, rating <dbl>
TEAM ANSWER TO: Are there any missing values (NAs)? Are all entries distinct or are there duplicate entries?
There are no missing values. There are no duplicate entries however the same titles appear multiple types which is indicative of movie remakes.
# number of movies by genre in descending order
movies_by_genre <- movies %>%
select(genre)%>%
count(genre, sort=TRUE, name = "NumOfMovies")
# return on budget
average_gross <- movies%>%
group_by(genre)%>%
summarise(avg_budget = mean(budget),
avg_gross = mean(gross))%>%
mutate(returnBudget = (avg_gross/avg_budget)-1)%>%
arrange(desc(returnBudget))
# top15 director by gross
top15_gDirectors <- movies%>%
group_by(director)%>%
summarise(total_gross = sum(gross),
avg_gross = mean(gross),
std_gross = sd(gross),
median_gross = median(gross))%>%
slice_max(.,order_by = total_gross, n=15)
#rating by Genre
ratingByGenre <- movies%>%
group_by(genre)%>%
summarise(mean_rating = mean(rating),
max_rating = max(rating),
min_rating = min(rating),
sd_rating = sd(rating),
median_rating = median(rating))
ggplot(movies, aes(y = rating, x = reorder(genre,-rating)))+
geom_boxplot()+
labs(x = "Genre",
y = "Rating",
title = "Ratings by Genre")+
theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 0.8))+
NULLggplot to answer the followinggross and cast_facebook_likes. Produce a scatterplot and write one sentence discussing whether the number of facebook likes that the cast has received is likely to be a good predictor of how much money a movie will make at the box office. What variable are you going to map to the Y- and X- axes?ggplot(movies, aes(x = cast_facebook_likes, y = gross))+
geom_point()+
scale_x_log10()+
scale_y_log10()+
geom_smooth(method = "lm")+
labs(x = "Number of Facebook Likes of the Cast Members",
y = "Gross Box Office Revenue",
title = "Relationship between Cast Facebook Likes and Gross Box Office Revenue")+
theme_bw()+
NULL sprintf("The correlation between cast facebook likes and gross is: %f",cor(movies$cast_facebook_likes, movies$gross))## [1] "The correlation between cast facebook likes and gross is: 0.213161"
TEAM ANSWER:
The relationship between gross and cast_facebook_likes is of positive has a 0.213 correlation which is not strong, meaning it is not the best predictor.
gross and budget. Produce a scatterplot and write one sentence discussing whether budget is likely to be a good predictor of how much money a movie will make at the box office.ggplot(movies, aes(x=budget, y = gross))+
geom_point()+
geom_smooth(method = "lm")+
scale_x_log10()+
scale_y_log10()+
labs(x = "Budget of the Movie",
y = "Box Office Gross Revenue",
title = "Relationship between Budget and Box Office Revenue")+
theme_bw()+
NULLsprintf("Correlation between budget and gross: %f", s<-cor(movies$budget, movies$gross))## [1] "Correlation between budget and gross: 0.640834"
TEAM ANSWER:
The correlation between a movies budget and gross revenue at the box office has a stronger correlation than the facebook likes of the cast and the revenue, it is a better predictor of gross, but the correlation is still below 0.7.
gross and rating. Produce a scatterplot, faceted by genre and discuss whether IMDB ratings are likely to be a good predictor of how much money a movie will make at the box office. Is there anything strange in this dataset?ggplot(movies,aes(x = rating, y = gross))+
geom_point()+
geom_smooth(method="lm")+
scale_y_log10()+
facet_wrap(~genre)+
labs(x = "Ratings",
y = "Gross Box Office Revenue",
title = "Relationship between Ratings and Box Office Revenue by Genre")+
theme_bw()+
NULL >TEAM ANSWER:
IMDB ratings are a weak predictor of a movies success at the box office as it can be seen for most cases the best fit line is almost horizontal. The dataset is unbalanced in the sense that Action movies are over represented while Thriller, Musical, Romance, Western and Family barely have datapoints. Using it to learn about tendencies in genre would result in a incorrect statistical conclusions. The dataset would result in biased machine learning learning models when it comes to predicting movie success.
You may find useful the material on finance data sources.
We will use the tidyquant package to download historical data of stock prices, calculate returns, and examine the distribution of returns.
We must first identify which stocks we want to download data for, and for this we must know their ticker symbol; Apple is known as AAPL, Microsoft as MSFT, McDonald’s as MCD, etc. The file nyse.csv contains 508 stocks listed on the NYSE, their ticker symbol, name, the IPO (Initial Public Offering) year, and the sector and industry the company is in.
nyse <- read_csv(here::here("data","nyse.csv"))
glimpse(nyse)## Rows: 508
## Columns: 6
## $ symbol <chr> "MMM", "ABB", "ABT", "ABBV", "ACN", "AAP", "AFL", "A", "…
## $ name <chr> "3M Company", "ABB Ltd", "Abbott Laboratories", "AbbVie …
## $ ipo_year <chr> "n/a", "n/a", "n/a", "2012", "2001", "n/a", "n/a", "1999…
## $ sector <chr> "Health Care", "Consumer Durables", "Health Care", "Heal…
## $ industry <chr> "Medical/Dental Instruments", "Electrical Products", "Ma…
## $ summary_quote <chr> "https://www.nasdaq.com/symbol/mmm", "https://www.nasdaq…
Based on this dataset, create a table and a bar plot that shows the number of companies per sector, in descending order
# YOUR CODE GOES HERE
companiesPerSector <- nyse%>%
group_by(sector)%>%
count(sort = TRUE, name = "numOfComp")
ggplot(companiesPerSector, aes(x = numOfComp, y = reorder(sector,numOfComp)))+
geom_col()+
labs(x = "Number of Companies",
y = "Sector",
title = "Number of Companies in each Sector within the Dataset")+
theme_bw()+
NULLNext, let’s choose some stocks and their ticker symbols and download some data. You MUST choose 6 different stocks from the ones listed below; You should, however, add SPY which is the SP500 ETF (Exchange Traded Fund).
# Notice the cache=TRUE argument inthe chunk options. Because getting data is time consuming,
# cache=TRUE means that once it downloads data, the chunk will not run again next time you knit your Rmd
myStocks <- c("AAPL","DIS","DPZ","ANF","TSLA","SPY" ) %>%
tq_get(get = "stock.prices",
from = "2011-01-01",
to = "2021-08-31") %>%
group_by(symbol)
glimpse(myStocks) # examine the structure of the resulting data frame## Rows: 16,098
## Columns: 8
## Groups: symbol [6]
## $ symbol <chr> "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL…
## $ date <date> 2011-01-03, 2011-01-04, 2011-01-05, 2011-01-06, 2011-01-07, …
## $ open <dbl> 11.6, 11.9, 11.8, 12.0, 11.9, 12.1, 12.3, 12.3, 12.3, 12.4, 1…
## $ high <dbl> 11.8, 11.9, 11.9, 12.0, 12.0, 12.3, 12.3, 12.3, 12.4, 12.4, 1…
## $ low <dbl> 11.6, 11.7, 11.8, 11.9, 11.9, 12.0, 12.1, 12.2, 12.3, 12.3, 1…
## $ close <dbl> 11.8, 11.8, 11.9, 11.9, 12.0, 12.2, 12.2, 12.3, 12.3, 12.4, 1…
## $ volume <dbl> 4.45e+08, 3.09e+08, 2.56e+08, 3.00e+08, 3.12e+08, 4.49e+08, 4…
## $ adjusted <dbl> 10.1, 10.2, 10.2, 10.2, 10.3, 10.5, 10.5, 10.6, 10.6, 10.7, 1…
Financial performance analysis depend on returns; If I buy a stock today for 100 and I sell it tomorrow for 101.75, my one-day return, assuming no transaction costs, is 1.75%. So given the adjusted closing prices, our first step is to calculate daily and monthly returns.
#calculate daily returns
myStocks_returns_daily <- myStocks %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "daily",
type = "log",
col_rename = "daily_returns",
cols = c(nested.col))
#calculate monthly returns
myStocks_returns_monthly <- myStocks %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "monthly",
type = "arithmetic",
col_rename = "monthly_returns",
cols = c(nested.col))
#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
group_by(symbol) %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "yearly",
type = "arithmetic",
col_rename = "yearly_returns",
cols = c(nested.col))Create a table where you summarise monthly returns for each of the stocks and SPY; min, max, median, mean, SD.
# YOUR CODE GOES HERE
monthlyReturnSummary <- myStocks_returns_monthly%>%
summarise(monthly_max = max(monthly_returns),
monthly_min = min(monthly_returns),
monthly_mean = mean(monthly_returns),
monthly_median = median(monthly_returns),
monthly_sd = sd(monthly_returns))Plot a density plot, using geom_density(), for each of the stocks
# YOUR CODE GOES HERE
ggplot(myStocks_returns_monthly, aes(monthly_returns))+
geom_density()+
facet_wrap(~symbol)+
labs(x = "Distribution of Monthly Return",
y = "Probability % ",
title = "Probability Distribution of Monthly Returns")+
theme_bw()+
NULLWhat can you infer from this plot? Which stock is the riskiest? The least risky?
TEAM ANSWER:
Tesla is the riskiest stock as it has the highest spread, which means its monthly standard deviations are the highest. The least risky is SPY as it has the narrowest spread.
Finally, make a plot that shows the expected monthly return (mean) of a stock on the Y axis and the risk (standard deviation) in the X-axis. Please use ggrepel::geom_text_repel() to label each stock
# YOUR CODE GOES HERE
library(ggrepel)
ggplot(monthlyReturnSummary, aes(x = monthly_sd, y = monthly_mean))+
geom_point(size = 4, aes(color = symbol))+
ggrepel::geom_text_repel(aes(label = symbol), size = 4)+
labs(x = "Standard Deviation of Monthly Returns",
y = "Average Monthly Returns",
title = "Risk vs Potential Return")+
theme_bw()+
NULLWhat can you infer from this plot? Are there any stocks which, while being riskier, do not have a higher expected return?
TEAM ANSWER:
ANF is much riskier than most, while delivering the lowest returns.
For this task, you will analyse a data set on Human Resoruce Analytics. The IBM HR Analytics Employee Attrition & Performance data set is a fictional data set created by IBM data scientists. Among other things, the data set includes employees’ income, their distance from work, their position in the company, their level of education, etc. A full description can be found on the website.
First let us load the data
hr_dataset <- read_csv(here::here("data", "datasets_1067_1925_WA_Fn-UseC_-HR-Employee-Attrition.csv"))
glimpse(hr_dataset)## Rows: 1,470
## Columns: 35
## $ Age <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
## $ Attrition <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "…
## $ BusinessTravel <chr> "Travel_Rarely", "Travel_Frequently", "Travel…
## $ DailyRate <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
## $ Department <chr> "Sales", "Research & Development", "Research …
## $ DistanceFromHome <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
## $ Education <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
## $ EducationField <chr> "Life Sciences", "Life Sciences", "Other", "L…
## $ EmployeeCount <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ EmployeeNumber <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,…
## $ EnvironmentSatisfaction <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
## $ Gender <chr> "Female", "Male", "Male", "Female", "Male", "…
## $ HourlyRate <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4…
## $ JobInvolvement <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
## $ JobLevel <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
## $ JobRole <chr> "Sales Executive", "Research Scientist", "Lab…
## $ JobSatisfaction <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
## $ MaritalStatus <chr> "Single", "Married", "Single", "Married", "Ma…
## $ MonthlyIncome <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
## $ MonthlyRate <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 9964…
## $ NumCompaniesWorked <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
## $ Over18 <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
## $ OverTime <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes",…
## $ PercentSalaryHike <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
## $ PerformanceRating <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
## $ StandardHours <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8…
## $ StockOptionLevel <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
## $ TotalWorkingYears <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
## $ TrainingTimesLastYear <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
## $ WorkLifeBalance <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …
## $ YearsAtCompany <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,…
## $ YearsInCurrentRole <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, …
## $ YearsSinceLastPromotion <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, …
## $ YearsWithCurrManager <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, …
I am going to clean the data set, as variable names are in capital letters, some variables are not really necessary, and some variables, e.g., education are given as a number rather than a more useful description
hr_cleaned <- hr_dataset %>%
clean_names() %>%
mutate(
education = case_when(
education == 1 ~ "Below College",
education == 2 ~ "College",
education == 3 ~ "Bachelor",
education == 4 ~ "Master",
education == 5 ~ "Doctor"
),
environment_satisfaction = case_when(
environment_satisfaction == 1 ~ "Low",
environment_satisfaction == 2 ~ "Medium",
environment_satisfaction == 3 ~ "High",
environment_satisfaction == 4 ~ "Very High"
),
job_satisfaction = case_when(
job_satisfaction == 1 ~ "Low",
job_satisfaction == 2 ~ "Medium",
job_satisfaction == 3 ~ "High",
job_satisfaction == 4 ~ "Very High"
),
performance_rating = case_when(
performance_rating == 1 ~ "Low",
performance_rating == 2 ~ "Good",
performance_rating == 3 ~ "Excellent",
performance_rating == 4 ~ "Outstanding"
),
work_life_balance = case_when(
work_life_balance == 1 ~ "Bad",
work_life_balance == 2 ~ "Good",
work_life_balance == 3 ~ "Better",
work_life_balance == 4 ~ "Best"
)
) %>%
select(age, attrition, daily_rate, department,
distance_from_home, education,
gender, job_role,environment_satisfaction,
job_satisfaction, marital_status,
monthly_income, num_companies_worked, percent_salary_hike,
performance_rating, total_working_years,
work_life_balance, years_at_company,
years_since_last_promotion)hr_numeric <- hr_dataset%>%
clean_names()%>%
select(age, attrition, daily_rate, department,
distance_from_home, education,
gender, job_role,environment_satisfaction,
job_satisfaction, marital_status,
monthly_income, num_companies_worked, percent_salary_hike,
performance_rating, total_working_years,
work_life_balance, years_at_company,
years_since_last_promotion)Produce a one-page summary describing this dataset. Here is a non-exhaustive list of questions:
attrition)age, years_at_company, monthly_income and years_since_last_promotion distributed? can you roughly guess which of these variables is closer to Normal just by looking at summary statistics?job_satisfaction and work_life_balance distributed? Don’t just report counts, but express categories as % of totalggthemesjob_role# job satisfaction by department
hr_numeric%>%
ggplot(aes(x=job_role, y = job_satisfaction))+
geom_boxplot()+
theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 0.8))+
labs(title = "Distribution of Job Satisfaction by Job Title",
x = "Job Title",
y = "Job Satisfaction")+
NULL The job satisfaction of Human Resource employees is significantly lower than every other department where the job satisfaction is consistent.
attrition_number <- count(hr_cleaned["attrition"]=="Yes")
attrition_frequency <- attrition_number/nrow(hr_cleaned)After looking at the attrition frequency we can see that on average employees leave the company 16% of the time.
skimr::skim(hr_cleaned)| Name | hr_cleaned |
| Number of rows | 1470 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| attrition | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| department | 0 | 1 | 5 | 22 | 0 | 3 | 0 |
| education | 0 | 1 | 6 | 13 | 0 | 5 | 0 |
| gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| job_role | 0 | 1 | 7 | 25 | 0 | 9 | 0 |
| environment_satisfaction | 0 | 1 | 3 | 9 | 0 | 4 | 0 |
| job_satisfaction | 0 | 1 | 3 | 9 | 0 | 4 | 0 |
| marital_status | 0 | 1 | 6 | 8 | 0 | 3 | 0 |
| performance_rating | 0 | 1 | 9 | 11 | 0 | 2 | 0 |
| work_life_balance | 0 | 1 | 3 | 6 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 36.92 | 9.14 | 18 | 30 | 36 | 43 | 60 | ▂▇▇▃▂ |
| daily_rate | 0 | 1 | 802.49 | 403.51 | 102 | 465 | 802 | 1157 | 1499 | ▇▇▇▇▇ |
| distance_from_home | 0 | 1 | 9.19 | 8.11 | 1 | 2 | 7 | 14 | 29 | ▇▅▂▂▂ |
| monthly_income | 0 | 1 | 6502.93 | 4707.96 | 1009 | 2911 | 4919 | 8379 | 19999 | ▇▅▂▁▂ |
| num_companies_worked | 0 | 1 | 2.69 | 2.50 | 0 | 1 | 2 | 4 | 9 | ▇▃▂▂▁ |
| percent_salary_hike | 0 | 1 | 15.21 | 3.66 | 11 | 12 | 14 | 18 | 25 | ▇▅▃▂▁ |
| total_working_years | 0 | 1 | 11.28 | 7.78 | 0 | 6 | 10 | 15 | 40 | ▇▇▂▁▁ |
| years_at_company | 0 | 1 | 7.01 | 6.13 | 0 | 3 | 5 | 9 | 40 | ▇▂▁▁▁ |
| years_since_last_promotion | 0 | 1 | 2.19 | 3.22 | 0 | 0 | 1 | 3 | 15 | ▇▁▁▁▁ |
By looking at the summary statistics provided in the data summary, the ‘age’ variable seems to be normally distributed as the interval between the mean, median, max and quartiles are the most even.
‘years_at_company’ variable has a right skewed pattern since the percentiles are not evenly distributed. 25th percentile is 0 meanwhile the median (p50) is 1, the 75th percentile is 3 but then the 100th percentile jumps to 15.
‘monthly_income’ is right skewed, the minimum is 0 and the max is 19999. The median (p50) is 4919. Which means there are either unpaid interns or there are mistakes in the dataset.
The ‘years_since_last_promotion’ variable, the 75th percentile (p75) is 3 meaning that 75 percent of the data is 3 or less but then the 100 percentile is 15 which tells us that the data is very right skewed.
library(ggrepel)
library(forcats)
library(formattable)
hr_cleaned %>%
group_by(job_satisfaction) %>%
summarise(count=n()) %>%
mutate(percent_t = count/sum(count),
job_satisfaction_n = case_when(
job_satisfaction == "Low" ~1,
job_satisfaction == "Medium"~2,
job_satisfaction == "High"~3,
job_satisfaction == "Very High"~4
)) %>%
ggplot(aes(x=fct_reorder(job_satisfaction,job_satisfaction_n),y=percent_t))+
geom_col()+
geom_text_repel(aes(label= sprintf("%0.2f %%", 100*percent_t)))+
scale_y_continuous(labels = percent)+
labs(title="Job Satisfaction",
x="Job Satisfaction",
y = "% of employees")+
theme_bw()+
NULLhr_cleaned %>%
group_by(work_life_balance) %>%
summarise(count= n()) %>%
mutate(percent_t = count/sum(count),
work_life_balance_n = case_when(
work_life_balance == "Bad" ~1,
work_life_balance == "Good" ~2,
work_life_balance == "Better" ~3,
work_life_balance == "Best"~4
)) %>%
ggplot(aes(x=fct_reorder(work_life_balance,work_life_balance_n),y=percent_t))+
geom_col()+
geom_text_repel(aes(label= sprintf("%0.2f %%", 100*percent_t)))+
labs(title="Work Life Balance",
x="Work Life Balance",
y = "% of employees")+
scale_y_continuous(labels = percent)+
theme_bw()+
NULL“job_satisfaction”: 61% people of the employees are very satisfied with their job, whereas around 19% of people are okay with their jobs. The remaining 20% are have low satisfaction ratings. Job satisfaction seems to be left skewed.
“work_life_balance”: This variable is a bit more normally distributed than job_satisfaction, although it is still left skewed. 10% of the people have a very good work life balance while 5% people are not happy with their work life balance. 60% of employees report a pretty good score and the remaining 23% seems to be doing okay.
hr_cleaned$education <- factor(hr_cleaned$education,levels=c("Below College","College","Bachelor","Master","Doctor"))
ggplot(hr_cleaned,aes(x = education, y = monthly_income))+
geom_boxplot() +
labs(x = "Education Level",
y = "Monthly Income ",
title = "Boxplot of Monthly Income by Education Level")+
theme_bw()+
NULLcor(hr_dataset["Education"],hr_dataset["MonthlyIncome"])## MonthlyIncome
## Education 0.095
Relationship between monthly income and education:
r-value:0.09496068
After looking at the box plot and our r-value, it can be seen that monthly income and education level has slight positive correlation. The more educated you are, the more income you earn. But because the r-value is still very close to zero, we would avoid saying they are highly correlated. Even though the r value is low, looking at the box plots it is obvious that education level can contribute to monthly income.
gender_income <- hr_dataset %>%
mutate(binary_gender = case_when(Gender=="Male" ~0 ,
Gender == "Female" ~ 1))%>%
select(MonthlyIncome,binary_gender)
cor(gender_income["binary_gender"],hr_dataset["MonthlyIncome"])## MonthlyIncome
## binary_gender 0.0319
ggplot(hr_dataset,aes(x = Gender, MonthlyIncome))+
geom_boxplot() +
labs(x = "Gender",
y = "Monthly Income ",
title = "Boxplot of Monthly Income by Gender")+
theme_bw()+
NULL Relationship between monthly income and gender:
r-value: 0.03185849. The r-value is positive towards women with a value of 0.03185849. But this is close enough to zero. Looking at the box plot of monthly income by gender, we can see that females make a little bit more but it is almost negligible.
hr_cleaned %>%
ggplot(aes(x=fct_reorder(job_role,-monthly_income),y=monthly_income))+
geom_boxplot()+
labs(title="Boxplot of Monthly Income Against Job Role", x="Job Role", y="Monthly Income")+
theme(axis.text.x = element_text(angle = 45, hjust = 0.8))+
NULLmean_education <- hr_cleaned %>%
group_by(education) %>%
summarise(mean=mean(monthly_income))
ggplot(mean_education,aes(x=education,y=mean))+
geom_col()+
labs(title = "Education vs Mean Monthly Income",
x = "Education Level",
y = "Mean Monthly Income ($)")+
theme_bw()+
NULLmedian_education <- hr_cleaned %>%
group_by(education) %>%
summarise(median=median(monthly_income))
ggplot(median_education,aes(x=education,y=median))+
geom_col()+
labs(title = "Education vs Median Monthly Income",
x = "Education Level",
y = "Median Monthly Income ($)")+
theme_bw()+
NULLggplot(hr_cleaned,aes(monthly_income))+
geom_histogram()+
facet_wrap(~education)+
labs( x = "Monthly Income", y = "Count",
title = "Histograms of Monthly Income by Education Level")+
theme_bw()+
NULLhr_cleaned %>%
ggplot(aes(y=monthly_income,x=age))+
geom_point()+
facet_wrap(~job_role)+
labs(x="Age", y="Monthly Income",
title= "Scatter Plots of Age vs Income by Job Role")The purpose of this exercise is to reproduce a plot using your dplyr and ggplot2 skills. Read the article The Racial Factor: There’s 77 Counties Which Are Deep Blue But Also Low-Vaxx. Guess What They Have In Common? and have a look at the attached figure.
You dont have to worry about the blue-red backgouns and don’t worry about replicating it exactly, try and see how far you can get. You’re encouraged to work together if you want to and exchange tips/tricks you figured out– and even though the figure in the original article is from early July 2021, you can use the most recent data.
Some hints to get you started:
TEAM NOTE:
Every datapoint that was below 90% completness was dropped to be in line with the authors process. The graph would be even closer if the seperate datasets used by the author would be downloaded as well. Authors Description of Data Utilisation
vax_complete_pop_pct <- vaccinations%>%
filter(completeness_pct>90.0)%>%
group_by(fips)%>%
summarise(series_complete_pop_pct = max(series_complete_pop_pct))
population <- population%>%
mutate(pop = pop_estimate_2019)
vax_complete_pop_pct <- left_join(x = vax_complete_pop_pct,
y = population, by = "fips")%>%
na.omit()
trump_votes <- election2020_results%>%
filter(candidate == "DONALD J TRUMP")%>%
filter(mode == "TOTAL")%>%
group_by(fips)%>%
mutate(prcOfVote = (candidatevotes/totalvotes)*100)
head(trump_votes)## # A tibble: 6 x 13
## # Groups: fips [6]
## year state state_po county_name fips office candidate party candidatevotes
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2020 ALABA… AL AUTAUGA 01001 PRESID… DONALD J… REPU… 19838
## 2 2020 ALABA… AL BALDWIN 01003 PRESID… DONALD J… REPU… 83544
## 3 2020 ALABA… AL BARBOUR 01005 PRESID… DONALD J… REPU… 5622
## 4 2020 ALABA… AL BIBB 01007 PRESID… DONALD J… REPU… 7525
## 5 2020 ALABA… AL BLOUNT 01009 PRESID… DONALD J… REPU… 24711
## 6 2020 ALABA… AL BULLOCK 01011 PRESID… DONALD J… REPU… 1146
## # … with 4 more variables: totalvotes <dbl>, version <dbl>, mode <chr>,
## # prcOfVote <dbl>
vax_complete_pop_pct <- left_join(x = vax_complete_pop_pct, y = trump_votes, by="fips")%>%
na.omit()library(Hmisc)
ggplot(vax_complete_pop_pct, aes(x = prcOfVote, y = series_complete_pop_pct))+
geom_point(alpha = 0.2, color = "snow4", aes(size = pop))+
annotate("rect", xmin = 45, xmax = Inf, ymin = -Inf, ymax = Inf, fill = "indianred1", alpha = 0.3)+
annotate("rect", xmin = 0, xmax = 55, ymin = -Inf, ymax = Inf, fill = "royalblue1", alpha = 0.3)+
annotate("line", x = seq(0,100), y = 85, lty = 2, color = "blue")+
annotate("text", x = 17, y = 87, label = "Herd Immunity threshold (?)", size = 2, fontface = 4, color = "blue")+
annotate("text", x = 15, y = 15, label = "y = -0.4956x + 0.73669\nR\u00B2 = 0.501", size = 2, color= "red", hjust = "centre", fontface = "bold")+
annotate("line", x = seq(0,100), y = 51.65, lty = 2,color = "blue")+
annotate("text", x = 10, y = 54, label = sprintf("ACTUAL: %0.2f %%",51.65), size =2, fontface = "bold", color = "blue")+
annotate("line", x = seq(0,100), y = 70, lty = 2,color = "blue")+
annotate("text", x = 10, y = 72, label = sprintf("TARGET: %0.2f %%",70), size =2, fontface = "bold", color = "blue")+
annotate("text", x = 40, y = 15, label = "5/09/2021", color = "red", fontface = "bold", size = 2)+
scale_size(range = c(.1, 15), name = "Population (M)")+
geom_point(size= 0.1)+
geom_smooth(method = "lm",
se = FALSE,
lty=5,
color = "blue",
lwd = 0.5)+
ylab("% of Total Population Vaccinated")+
xlab("2020 Trump vote %")+
labs(title = "COVID-19 VACCINATION LEVELS \nOUT OF TOTAL POPULATION BY COUNTY")+
theme_light()+
scale_y_continuous(expand = c(0,0), labels = function(y) paste0(y, "%"), breaks = scales::pretty_breaks(n=20), limits = c(0,100))+
scale_x_continuous(expand = c(0,0), labels = function(x) paste0(x, "%"), breaks = scales::pretty_breaks(n=20), limits = c(0,100))+
theme(aspect.ratio = 20/18,
axis.text.x = element_text(size = 5),
axis.text.y = element_text(size = 5),
axis.title = element_text(size = 10),
plot.title = element_text(size = 10, face = "bold", hjust = 0.5))print(summary(lm(vax_complete_pop_pct,formula = series_complete_pop_pct ~ prcOfVote)))##
## Call:
## lm(formula = series_complete_pop_pct ~ prcOfVote, data = vax_complete_pop_pct)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.71 -4.52 0.16 4.76 34.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.7843 0.7827 94.3 <2e-16 ***
## prcOfVote -0.4952 0.0119 -41.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.76 on 1721 degrees of freedom
## Multiple R-squared: 0.502, Adjusted R-squared: 0.502
## F-statistic: 1.74e+03 on 1 and 1721 DF, p-value: <2e-16
sprintf("Percentage of the total population vaccinated: %0.2f ",sum(vax_complete_pop_pct$series_complete_pop_pct*vax_complete_pop_pct$pop)/sum(vax_complete_pop_pct$pop))## [1] "Percentage of the total population vaccinated: 51.78 "
The Guardian newspaper has an election poll tracker for the upcoming German election. The list of the opinion polls since Jan 2021 can be found at Wikipedia and your task is to reproduce the graph similar to the one produced by the Guardian.
The following code will scrape the wikipedia page and import the table in a dataframe.
url <- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election"
# https://www.economist.com/graphic-detail/who-will-succeed-angela-merkel
# https://www.theguardian.com/world/2021/jun/21/german-election-poll-tracker-who-will-be-the-next-chancellor
# get tables that exist on wikipedia page
tables <- url %>%
read_html() %>%
html_nodes(css="table")
# parse HTML tables into a dataframe called polls
# Use purr::map() to create a list of all tables in URL
polls <- map(tables, . %>%
html_table(fill=TRUE)%>%
janitor::clean_names())
# list of opinion polls
german_election_polls <- polls[[1]] %>% # the first table on the page contains the list of all opinions polls
slice(2:(n()-1)) %>% # drop the first row, as it contains again the variable names and last row that contains 2017 results
mutate(
# polls are shown to run from-to, e.g. 9-13 Aug 2021. We keep the last date, 13 Aug here, as the poll date
# and we extract it by picking the last 11 characters from that field
end_date = str_sub(fieldwork_date, -11),
# end_date is still a string, so we convert it into a date object using lubridate::dmy()
end_date = dmy(end_date),
# we also get the month and week number from the date, if we want to do analysis by month- week, etc.
month = month(end_date),
week = isoweek(end_date)
)library(plotly)
german_graph <- ggplot(german_election_polls, aes(x=end_date)) +
geom_smooth(aes(y=union), color="black", size=0.7, span=0.05, se=FALSE)+
geom_smooth(aes(y=spd), color="darkred", size=0.7, span=0.05, se=FALSE)+
geom_smooth(aes(y=grune), color="darkgreen", size=0.7, span=0.105, se=FALSE)+
geom_smooth(aes(y=fdp), color="yellow", size=0.7, span=0.05, se=FALSE)+
geom_smooth(aes(y=af_d), color="darkblue",size=0.7, span=0.05, se=FALSE)+
geom_smooth(aes(y=linke), color="purple", size=0.7, span=0.05, se=FALSE) +
geom_point(aes(y = union), color = "black", size=0.5, alpha=0.3, show.legend = FALSE) +
geom_point(aes(y = spd), color = "darkred", size=0.5, alpha=0.3, show.legend = FALSE) +
geom_point(aes(y = grune), color="darkgreen", size=0.5, alpha=0.3, show.legend = FALSE) +
geom_point(aes(y= fdp), color="yellow", size=0.5, alpha=0.3, show.legend = FALSE) +
geom_point(aes(y= af_d), color="darkblue", size=0.5, alpha=0.3, show.legend = FALSE) +
geom_point(aes(y= linke), color="purple", size=0.5, alpha=0.3, show.legend = FALSE) +
labs(
title = "German Election Poll Tracker",
subtitle = "who will be the next Chancellor?",
x = "Date (last updated 3 Sep 2021)",
y = "Percent of Voter Population")+
theme_light()
fig <- ggplotly(german_graph + ylim(5, 45))
fig<- fig%>%
layout(hovermode = "x unified")
fig <- style(fig, hoverinfo = "skip", traces = 7:13)
figThere is a lot of explanatory text, comments, etc. You do not need these, so delete them and produce a stand-alone document that you could share with someone. Knit the edited and completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.
Please seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!
As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?
Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.
Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).
Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.